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Discriminative latent variable models (LVM) are frequently applied to various visual 
recognition tasks. In these systems the latent (hidden) variables provide a formalism for 
modeling structured variation of visual features. Conventionally, latent variables are de¬ 
fined on the variation of the foreground (positive) class. In this work we augment LVMs 
to include negative latent variables corresponding to the background class. We formalize 
the scoring function of such a generalized LYM (GLVM). Then we discuss a framework 
for learning a model based on the GLVM scoring function. We theoretically showcase 
how some of the current visual recognition methods can benefit from this generalization. 
Finally, we experiment on a generalized form of Deformable Part Models with negative 
latent variables and show significant improvements on two different detection tasks. 


1 Introduction 

discriminative latent variable models (LVM) are frequently and successfully applied to var¬ 
ious visual recognition tasks [ 0 , □, O, □, O, O, US] . In these systems the latent (hidden) 
variables provide a formalism for modeling structured variation of visual features. These 
variations can be the result of different possible object locations, deformations, viewpoint, 
subcategories, etc. Conventionally, these have all been defined only based on the foreground 
(positive) class. We call such latent variables “positive”. In this work we introduce general¬ 
ized discriminative LVM (GLVM) which use both “positive” and “negative” latent variables. 
Negative latent variables are defined on the background (negative) class and provide counter 
evidence for presence of the foreground class. They can, for instance, learn mutual exclu¬ 
sion constraints, model scene subcategories where the positive object class is unlikely to be 
found, or capture specific parts which potentially indicate the presence of an object of a sim¬ 
ilar but not the same class. 


The objective of the proposed framework is to learn a model which maximizes the evidence 
(characterized by positive latent variables) for the foreground class and at the same time min¬ 
imizes the counter evidence (characterized by negative latent variables) lying in background 
class. Thereby GLVM empowers the latent variable modeling by highlighting the role of 
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negative data in its own right. An interesting analogy is with game theory; when playing a 
game a simple strategy is to maximize your score. This resembles LVM’s objective which is 
to maximize evidence for the presence of the foreground class. A better strategy, however, 
is to maximize your scores at the same time as minimizing the opponents’ scores. This is 
analogous to GLVM’s objective which takes into account counter evidence emerging from 
negative latent variables while looking for evidence from the foreground class. The score of 
the opponents is counter evidence in our hypothetical example. 

The concept of negative parts was noted in [O]. However, [O] focuses on automatic dis¬ 
covery and optimization of a part based model with negative parts. In this paper we extend 
the notion of negative parts to negative latent variables and propose a framework for defining 
models that use both positive and negative latent variables. 

Many different modeling techniques benefit from latent variables. Felzenszwalb et al. [E] 
introduced latent SVM to train deformable part models, a state of the art object detector [□]. 
Yu and Joachims m transferred the idea to structural SVMs. Yang et al. [O] proposed a 
kenelized version of latent SVM and latent structural SVM and applied it to various visual 
recognition tasks. Razavi et al. [O] introduced latent variables to the Hough transform and 
achieved significant improvements over a Hough transform object detection baseline. And- 
Or trees use latent variable modeling for object recognition and occlusion reasoning [D3]. 

In this paper we introduce a novel family of models which generalizes discriminative latent 
variable models. We review LVMs in Section 2, formulate our generalization of LVMs in 
Section 3. Then, we discuss training of GLVMs and propose an algorithm for learning the 
parameters of a GLVM in Section 4. In Section 6 we discuss how various computer vision 
models can benefit from GLVM. Finally, in Section 7 we experiment on generalized DPMs. 

2 Discriminative Latent Variable Models (LVM) 

Consider a supervised classification problem provided with a set of n training examples 
X G 7Z dxn and their corresponding labels Y G y n . The goal of discriminative learning is then 
to learn a scoring function S w (x,y), parametrized by w, that ranks a data point Xi and its true 
label yt higher than the rest of labels. For a linear scoring function this is written as follows: 

Sw(xi,yi) > S w (xi,y) Vy ^ y t (1) 

S w (xi,y)=w T <f)(xi,y) (2) 

Here (j>(x,y) is the representation function (e.g. Histogram of Oriented Gradients, ConvNet 
representation, etc.). Since the scoring function of (2) is linear in w any invariance to non¬ 
linear object class deformation should be reflected in the representation function (j). This 
requires a very rich feature encoding. LVMs model known structured variations of the ob¬ 
ject class by using various latent variables a corresponding to different sources of variation. 
Each latent variable models a source of variation and accordingly alters the representation 
(/) so that it best matches the model parameters w. Thus, all latent variables in an LVM are 
maximized over. For instance, in weakly supervised object detection location of the object 
in an image is modeled by a latent variable. The scoring function of an LVM takes this form: 


Sw(*hy) =ma xw T (j)(xi,y : a) 

a 


(3) 
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Figure 1: Example of an LVM scoring function on three latent variables (left), a Simple GLVM scoring function (middle), and a GLVM in its 
canonical form. The models become more and more complex from left to right. 

In some cases the set of labels y comprises of only known classes. In open-set classification, 
however, one of the possible labels in y refers to an unknown class. The simplest form of 
open-set classification is binary classification where samples are classified into the class of 
interest where y = +1 ( e.g . cat) or anything else where y= —1 ( e.g . non-cat) which are so 
called foreground and background respectively. In this work we are interested in open-set 
classification problems which suit many real-world applications better. In the next section 
we show how the latent variables in the scoring function of (3) can be altered to “model” 
the negative data in its own right and thereby enrich the expressive power of discriminative 
latent variable models. In the rest of the paper we focus on the binary classification problem. 

3 Generalized LVM with Negative Modeling (GLVM) 

The scoring function in discriminative LVM is usually constructed in such a way that it looks 
for “evidence” for the foreground class, and the weakness (or lack) of such evidence indi¬ 
cates that the input is from the background class. For instance, in detecting office scenes, the 
location of a work desk (evidence for office) can be modeled by a latent variable. 

We generalize LVMs by equipping them with latent variables that look for “counter evi¬ 
dence” for the foreground class. For example, detecting a stove (evidence for kitchen) or 
sofa (evidence for living room) provides counter evidence for office. Such latent variables 
can also be contextual, for example for a cow detector, presence of saddle/rider/aeroplane 
counts as counter evidence whereas presence of meadow/stable/milking parlor counts as ev¬ 
idence. For visual examples refer to Section 1 in supplementary material. 

We call the latent variables that collect evidence for the foreground class positive latent vari¬ 
ables and the latent variables that collect counter evidence for the foreground class negative 
latent variables and denote them by a and b respectively throughout the paper. 

Representation function. Before we go further, let’s explain how our representation func¬ 
tion (j) is constructed. We assume the model parameters (associated with all latent variables) 
are concatenated into a single vector w. Moreover, w and its corresponding feature vector 
••■) have the same dimensionality. Every latent variable a^ or bk encodes 
its own place in (j>(). The places for different latent variables do not overlap. When some la¬ 
tent variables are omitted from the inputs of 0(), their place is filled with zero vectors. These 
assumption are made for the ease of notation in the upcoming formulations. For clarity we 
take three steps towards defining the generic form of GLVM scoring function. 

Simple Form. The simplest form of a GLVM scoring function involving two independent 
sets of positive and negative latent variables is (Z is the set of possible latent assignments): 

S w (xi,y) = max w T <f){x u y,a) - ma x w T </>(x ii y,b) 
aeZ+ beZ~ 


(4) 
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Note that the second max cannot be absorbed in the first max. Thus, the GLVM scoring 
function is more general than that of LVM. In particular, (4) reduces to (3) when \Z~\ < 1. 
One can characterize positive and negative latent variables according to the way they impact 
the value of the scoring function. Each assignment of a latent variable has a corresponding 
value associated to it. For instance, in the deformable part models (DPMs) latent variables 
indicate the location of the object parts [E]. Each assignment of a latent variable in a DPM 
corresponds to a placement of a part which in turn is associated with a value in the score map 
of the part. A positive (negative) latent variable is one whose associated value is positively 
(negatively) correlated with the final value of the scoring function. This should not be con¬ 
fused with negative entries in the model vector w. In a linear model with the score function 
S{x) = w T (j)(x ) the value of the d- th dimension of the feature vector {i.e. (j>{x)d) is negatively 
correlated with S{x) when wj < 0. Yet, negative latent variables are fundamentally different 
from this. In particular, a negative latent variable cannot be obtained by flipping the sign in 
some dimensions of the parameter vector and using a positive latent variable instead. The 
value associated with a negative latent variable {i.e. max^w 7 <j>(x,b)) is a convex function of 
w whereas the value associated to negative entries in w {i.e. wj^(v)j) is linear in w. 

Dependent Form. The value of the GLVM score can be thought of as the value of the best 
strategy in a 2-player game. To show this we need to modify the notation as follows: 

S w (xi) = max w T (/)(xi,a) + min —w r 0(x;,&) 
aeZ+ bez~ 

— max min w T (f){xi,a) —w T (f){xi,b) 

aez+bez- 

= max min w T </){xi,a,b) (5) 

aez+bez- 

In the first step we dropped y from the notation since we will be focusing on binary clas¬ 
sification. In the second step min is pulled back. For the last step we let (j){xi,a,b) = 
(/){xi,a) — (j)(xi,b) although any feature function that depends on both b and a would do. So, 
(5) is more general than (4). Note that the game theory analogy of minimax is now more 
apparent in the new formulation of (5). 

Canonical Form. The scoring function in (5) is linear when the latent variables a and b 
are fixed. One can imagine augmenting this linear function with more positive and nega¬ 
tive latent variables; the same way that we extended (2) to get to (5). This creates multiple 
interleaved negative and positive latent variables. This way, we can make a hierarchy of al¬ 
ternating latent variables recursively. This general formulation has interesting ties to compo¬ 
sitional models, like object grammar [B], and high-level scene understanding. For example, 
we can model scenes according to the presence of some objects (positive latent variable) and 
the absence of some other objects (negative latent variable). Each of those objects them¬ 
selves can be modeled according to presence of some parts (positive) and absence of others 
(negative). We can continue the recursion even further. The GLVM scoring function, after 
being expanded recursively, can be written in the following general form: 

Sw{*i) = maxmin maxmin ... maxmin w T {p (xj, (a^,^)L ) (6) 

a\ b\ «2 &2 a K b K V / 

We call this the canonical form of a GLVM scoring function. Note that any other scoring 
function that is a linear combination of max’s and min’s of linear functions can be converted 
to this canonical form by simple algebraic operations. Moreover, (6) with K > 1 and non 
empty and bk cannot be simplified into (5), in general. Its proof follows the max-min 
inequality [□]. See Figure 1 for an example. 
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4 Training GLVMs for Classification 

In this section we propose an algorithm for training GLVMs to do binary classification. We 
focus on SVM training objective and use hinge loss. That being said, the same idea can be 
used to optimize any training objective that involves a linear combination of convex and/or 
concave functions of GLVM scores; e.g. SVM with squared hinge loss. For the ease of no¬ 
tation we drop the subscript w in S w in the rest of this section. 

Let D = {(Any *)}?= 1 denote a set of n labeled training examples. Each training example is a 
pair (xi,yi) where xt is the input data and y* E {+1, —1} is the ground-truth label associated 
with it. SVM training objective can be written as follows: 

0{w) = ^||w|| 2 + C^max(0,l-y/5(^)) (7) 

z ;=l 

where S(xi) is a scoring function. Training a classifier requires finding the model w* that 
minimizes the function i.e. w* = argmin w 0(w). In most practical applications, however, the 
objective function O is complex (in particular, non-convex). This makes it hard to find the 
global minimizer. Thus, we usually have to resort to local minima of the objective function. 

The concave-convex procedure (CCCP) [O] is a general framework for optimizing non- 
convex functions. For minimization, in each iteration the concave part is replaced by a hy¬ 
perplane tangent to it at the current solution. The result is a convex function. Local minima 
of a convex function are global minima. Thus, one can use simple methods such as gradient 
descent to minimize a convex function. 

We train GLVMs using an iterative algorithm that is very similar to CCCP [□]. But, unlike 
CCCP, we do not linearize the concave part of the objective function. Our training algorithm 
alternates between two steps: 1) constructing a convex upper bound to the training objective 
O and 2) minimizing the bound. Let B t (w) denote the bound constructed at iteration t. Also 
let w t = argmin w B t (w) be the minimizer of the bound. All bounds must touch the objective 
function at the solution of the previous bound; that is B t (w t ~i) = 0(w t -\),\/t. This process 
is guaranteed to converge to a local minimum or a saddle point of the training objective O. 

To construct bounds B t , as we shall see next, we only need to construct a concave lower 
bound S(xi) and a convex upper bound Sfa) to the scoring function of (6). We define B t as 
follows: 


B t (w) = -||w|| 2 + C L max (0,1 +S t (xi))+C £ max(0,1-$(*,■)) (8) 

z i—\ i= l 

^=-1 ^=+1 

In the next section we will explain how S(xj) and S(xi) are constructed for GLVMs. Algo¬ 
rithm 1 in the supplementary material summarizes all steps in our training procedure. 

Convex Upper Bound and Concave Lower Bound. Here we explain how we obtain S t (v,-) 
and S t (xi) when GLVM scoring function (6) is used. We fix all the negative latent variables 
in S(xi) to get S t (xf) and fix all the positive latent variables to get S t (xi). 

Let w t denote the model at iteration t and let a = (dq,... be an arbitrary assignment 
of the positive latent variables. Given a , we denote the fixed values for the negative latent 
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variables on image Xi by frW (a) = {b ^and define it as in (9). Similarly, we define 
a,( l \b) = (a ^\... ,a$) as in (10). 

b {l \a) = argmin wf (j) (x u (a k ,b k ) f =1 ) (9) 

b u ...,b K v 7 

a®(b) = argmax wf 0 fx t , (a k ,b k )f =1 ) (10) 

ai,...,a K v 7 

Note that (9) uses a min and (10) uses a max. Also, note that in both equations the fixed 
assignments are computed according to w t . Finally, we define S t {xi) and S t {xi) as follows: 


S t (xi) = max max ... max w T (j) lXi 1 (a^b^ (a)\ 
a\ a 2 a K \ \ K / 


f 

'k= 1 


St(xj) = min min ... min w 

b\ b 2 b K 




(ID 

( 12 ) 


(11) is convex because it is a max of linear functions and it is an upper bound to S(x) because 
the min functions are replaced with a fixed value. Similarly for (12). 

For discussions regarding the dependency structure of the latent variables and how it affects 
the scoring and inference functions refer to Section 2 in the supplementary material. 


5 Connection to existing Models 

LSSVM: An interesting connection is to latent structural SVMs (LSSVM) [03] where the 
scoring function is given by, 


S(xi,yi) = max w T <f)(xi,yi,h) (13) 

h 

In this form, the formulation of scoring function does not make use of negative latent vari¬ 
ables as defined in GLVM. In LSSVM objective function the individual loss induced by a 
sample (xi,yi) is calculated as below, 

L(*i,yi) = m ax[w r 0(^-,y,/z)+A(y/,y,/z)] — maxw r 0(^,y/,/i) (14) 

(y,h) h 

Note that the representation function takes as input both class y and latent variable h. Now, 
assume a binary classification (y E { — 1,1}) framework with a symmetric loss function A 
independent of h. Then, the scoring and loss function can be reformulated as, 

S r (xi) = maxiv r 0(x/,+l,/i) — maxw r 0(x/, — l,h) (15) 

h h 

L(xi,yt) = max(0, A(+l, —1) - jjS'(*,•)) (16) 

Note that (15) is similar to the shallow structure of GLVMs in (4). This shows how negative 
latent variables can potentially emerge from LSSVMs. 

Mid-level features based recognition: Many recent scene image classification techniques 
pre-train multiple mid level features (parts) [0, i, □] to represent an image. They construct 
the representation of an image by concatenating the maximum response of each part in the 
image and then train a linear classifier on top of those representations. Naderi et al. [O] 
have shown that training a classifier on those representations effectively corresponds to the 
definition of negative parts. In these works, the score of a part at an image is maximized over 
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different locations and thus its position and scale are, in essence, treated as a latent variables. 
In that respect, a negative weight in the linear classifier for such a latent variable can be 
translated to a negative latent variable ( i.e . its maximum score contributes negatively to the 
score of an image). The training procedure for the scene classification models of [0, i, B, O] 
is similar to the shallow version of the GLVM framework proposed in this paper. 

ConvNet: [Of] argues that the kernels in Convolutional Networks (ConvNet) can be inter¬ 
preted as parts in part based models and shows that DPM is effectively a ConvNet. However, 
since ConvNets have the extra ability of learning weights (potentially negative) over these 
multiple kernels it is modeling counter evidences in a similar form as GLVM. 


6 Showcases 

To elicit the implications of the proposed generalization of LVMs, in this section, we review 
a few different state of the art computer vision models and demonstrate how they can benefit 
from the proposed framework. We consider adding negative latent variables to deformable 
part models, And-Or trees, and latent Hough transform and discuss the implications of it. 


Show Case 1: Deformable Part Models. Deformable Part Model (DPM) [0] is an ob¬ 
ject detector that models an object as a collection of deformable parts. DPM uses multiple 
mixture components to model viewpoint variations of the object. The location of the parts 
(denoted by lj) and the choice of the mixture component (indexed by c) are unknown and 
thus treated as latent variables. The scoring function of a DPM with n parts can be written 
as follows: 


S(x) = max V maxw^T(*,//,c) (17) 

C=1 j=l 'J 

We extend DPMs to include negative parts. We refer to this as generalized DPM (GDPM). 
For instance, when detecting cow, positive parts in DPM include head, back, legs of a cow. 
In GDPM one can imagine several candidates for negative parts. For example, since horses 
appear frequently in the high scoring false positives of cow detectors, saddle or horse-head 
may be good negative parts for a cow detector. We define GDPM scoring function as: 

k f n m .. 

S(x)=max ^ ma xw T (j)(x,lj+,c) — ^ maxw T (j)(x,lj-,c)\ (18) 

y‘+=l A j =1 i 

Following the recursion of GLVM, the outer max can also be cloned with a negative coun¬ 
terpart. In that respect, the negative components would look, for instance, for different view¬ 
points of a horse with its own positive and negative parts. We implemented GDPM and 
evaluated it on two different datasets. We discuss the experimental results in Section 7. 

Show Case 2: And-Or trees. And-Or trees m are hierarchical models that have been 
applied to various visual recognition tasks. And-Or trees are represented by a tree-structured 
graph. Edges in the graph represent dependencies and nodes represent three different oper¬ 
ations. Terminal nodes (7}) are directly applied to the input image. A typical example of a 
terminal node is a template applied at a specific location of the image. Or nodes ( Oi ) take the 
maximum value among their children. An Or node can model, for example, the placement of 
terminal nodes or the choice of mixture components. And nodes (A/) aggregate information 
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by summing over their children. The scoring functions of these nodes are defined as follows: 


S(T i \I-w i ) = wJl(T i ), 5(A,-|C) = £ 5(c), 5(0,|C) = max 5(c) (19) 

cec(A,) ceC (°‘) 

C encodes the hierarchy of the nodes and C(N ) denotes children of node N. In order to 
compute the score of an input image the scores are propagated from the terminal nodes to 
the root of hierarchy using dynamic programming. Or nodes find the most plausible inter¬ 
pretation, And nodes aggregate information, and terminal nodes collect positive evidence for 
the foreground class. Generalizing an And-Or tree with negative latent variables introduces 
a new type of node which we call a Nor node (V;). Nor nodes collect counter evidence for 
the class of interest. We define the scoring function of a Nor node as follows: 

S(Ni\C)= min S(c ) (20) 

ceC(Ni ) 


Show Case 3: Latent Hough Transform. Hough transform (HT) has been used as a non- 
parametric voting scheme to localize objects. In Implicit Shape Models [DU] the score of 
a hypothesis h ( e.g . object location and scale) in an image x is calculated by aggregating 
its compliance to each positive training images. Let D p denote the set of positive training 
images. Also let S(x,h\xi) denote the vote of a positive training image Xi G D p in favor of 
hypothesis h. The scoring function of HT is defined as follows: 

S HT {x,h) = £ S(x,h\ Xi ) (21) 

ieDp 

One of the main issues with HT is the aggregation of votes from inconsistent images. For 
instance, training images of side-view cars contribute to the voting in a test image of a frontal- 
view car. Razavi et al. [□] alleviate this issue by introducing latent Hough transform (LHT). 
LHT groups consistent images (e.g. exhibiting same viewpoints) together using a latent vari¬ 
able z G Z. Let (j) (x, h) denote the vector of votes for hypothesis h obtained from all positive 
training images i.e. (j>(x,h)i = S(x,h\xi). LHT assigns weight w Zj t to the positive training 
image x; G D p indicating its relevance to the selected mixture component z. Scoring function 
of an LHT parameterized by w is defined as follows: 

Slht (x, h ) = max wj (j) (x, h) (22) 

zez 

In order to reformulate the problem as a GLVM we need to introduce negative votes [□] cast 
by negative training images in the scoring function of HT. 


Sht (x, h) = Y S(h\xi)- Y s ( h \ x i ) ( 23 ) 

i^Dp iED n 

Then the scoring function of a GLVM generalization of LHT is given by: 

Slnht ( x , h) - max w T 7+ (j) (x, h ) — max w^~ 0 (x, h) (24) 

z+ z~ 

It should be emphasized that negative latent variables might be crucial when using negative 
votes in HT due the high multimodality of negative data distribution. That is negative images 
are probably more susceptible to the aggregation of inconsistent votes. This might explain 
the reason why negative voting is not popular for HT based object recognition. Note that the 
GLVM version of LHT can make use of the negative data in a more effective way. 
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bird cat cow dog horse sheep 

dpm 8 

10,9 16.6 21.9 12.5 56.0 18.0 

GDPM* 

10.9 17.0 22.0 12.5 57.2 19.2 

GDPMg 

9.5 18.6 23.2 12.2 56.0 19.8 

gdpm| 

10.5 13.9 21.7 11.9 54.8 19.1 



(a) AP for animal classes of VOC 2007, com¬ 
paring DPM with GDPM with different number of 
parts. Sub (super) indices indicate the number of 
positive (negative) parts. 


(b) Analysis of false positives of cow using [ 0 ]. DPM (left), GDPM j? (middle) and GDPM j 
(right) with similar part initialization. The volumes show the commulative number of top de¬ 
tections within 5 different categories: true detection (white), localization FP (blue), background 
FP (purple), similar objects (red), other objects (green). Red (dashed red) line is the percentage 
of recall at each number of detections using 50% (10%) overlap threshold. GDPM with careful 
initialization (right figure) slightly reduces confusion with similar objects. 


Figure 2: PASCAL VOC 2007 


7 Experiments 


To demonstrate the efficacy of a GLVM we evaluate it on deformable part models (DPM) by 
augmenting it with negative parts (GDPM). As discussed in Section 4 adding negative parts 
changes the optimization framework of DPMs. This includes many noteworthy details about 
inference, relabeling, data mining, features cache, etc. For details of these modifications re¬ 
fer to Section 4 in supplementary material. We conducted our experiments of GDPM on two 
different datasets. Similar to [0] HOG is used as feature descriptor but the observed trends 
should be independent of this choice. The results are measured using Average Precision of 
the PR-curve. We use the PASCAL VOC criterion of 50% intersection over union for recall. 


PASCAL VOC 2007 Animals. We first evaluate GDPM on animal subset of VOC 2007. 
The training set of VOC 2007 contains about 4500 negative images and a few hundreds 
(~ 400) of positive instances per animal class. The same goes for the test set. We use the 
same hyper-parameters as used for original DPM [0], namely 6 components (3 + 3 mirror 
components) and 8 parts per component. Our results slightly differ from those reported 
in [□] due to absence of post-processing (bounding box prediction and context re-scoring) 
and slight differences in implementation. Figure 2a summarizes the results for substituting 
some positive parts in DPM with negative parts. 4 out of the 6 tested classes benefit from 
replacing positive parts with negative parts. Depending on the class the optimal number of 
negative parts is different. For PR curves refer to Section 6 in supplementary material. 

Initialization. Similar to DPM [0], we initialized the location and appearance of positive 
(negative) parts by finding the sub-patches of the root filter which contained highest positive 
(negative) weights. However, we observed that negative parts in GDPM are quite sensitive 
to initialization. This is due to the non-convexity of the optimization. Furthermore, while 
the positive label corresponds to a single visual class the negative label encompasses more 
classes and thus exhibits a multi-modal distribution. But, we are interested in discrimination 
of the positives, thus we only need to model the boundary cases of the negative distribution. 
Studying these boundary cases for DPM model of cow revealed that most of negative sam¬ 
ples come from "sheep" and "horse". Initializing GDPM negative parts of cow with a part 
from similar objects increased the performance from 21.9 to 26.7. 
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Abyssinian 

Bengal 

Birman 

Bombay 

British 

Shorthair 

Egyptian Maine 

Mau Coon 

Persian 

Ragdoll 

Russian 

Blue 

Siamese 

Sphynx 

dpm 4 

21.3 

12.8 

34.5 

23.3 

32.2 

15.8 

21.6 

28.0 

19.4 

24.0 

29.4 

22.7 

DPMg 

22.2 

12.8 

31.4 

21.5 

31.3 

16.5 

26.0 

29.0 

20.6 

25.0 

30.9 

22.0 

GDPM\ 

24.3 

11.9 

38.4 

23.9 

31.0 

19.7 

27.7 

29.7 

24.5 

29.9 

35.9 

27.4 

gdpmI 

25.5 

13.7 

34.0 

23.0 

33.5 

20.9 

24.5 

30.0 

21.9 

25.3 

30.6 

23.2 


Table 1: Cat Head Detection: Results comparing DPM and GDPM with different number of positive and negative parts. Sub (super) 
indices show the number of positive (negative) parts. GDPM consistently outperforms DPM variants. The average cat head images are taken from [O]. 


Negative Parts. Although, adding negative parts helps in general, it seems that, unless the 
parts are carefully initialized, the discovered negative parts with default initialization are not 
visually meaningful. We analyzed the false positives (FP) of DPM and GDPM detectors sim¬ 
ilar to [B] and observed that some of the FP reduction in GDPM is from non similar object 
classes or background. However, when we initialized the negative parts using other classes, 
the ratio of FP due to similar objects decreased more significantly (See Figure 2b). 

Cat Head. The second task is detection of heads for 12 breeds of cats. The cat head images 
are taken from Oxford Pets dataset [O]. For each cat breed the distractor set contains full 
images (including background clutter) of the other 11 breeds and 25 breeds of dogs. Each 
cat breed has around 65 positive images and 3500 negative images for training. Test set 
has the same number of images. This task is particularly hard due to the low number of 
positive training images and high level of similarity between cats heads which is sometimes 
hard for humans to distinguish. For sample images refer to Section 5 in supplementary 
material. Table 1 reports the results by comparing DPM and GDPM using various number 
of positive/negative parts. The aligned average images in [HE] is used as header in Table 1. 
Since the number of positive images in the training set is low, we used one component per 
detector. It can be seen that GDPM consistently outperform the DPM using different number 
of parts. For a few classes ( e.g . Bengal) DPM with 4 positive parts outperform GDPM with 2 
positive and 2 negative parts, but as we increase the number of parts to 6, replacing 2 positive 
parts with 2 negative parts proves beneficial. We observed that adding more parts (more than 
6) degrades the results due to over-fitting to the small number of positive images. Negative 
parts seems to be more robust for the classes where going from total of 4 parts to 6 parts 
show signs of over-fitting (e.g. Birman, Sphynx). 

8 Conclusion 

In this work we proposed a new framework for discriminative latent variable models by 
explicitly modeling counter evidences associated with negative latent variables along with the 
positive evidences and by that emphasized the role of negative data. We further described a 
general learning framework and showcased how common computer vision latent models can 
benefit from the new perspective. Finally, we experimented on two tasks of object detection 
revealing the benefits of negative latent variables. We further observed that expressive power 
of GLVM crucially depends on a careful initialization. 

We think the proposed GLVM framework yields to fruitful discussions and works regarding 
various latent variable models. 


Acknowledgement. We’d like to thank Pedro Felzenszwalb for the fruitful discussions. 
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Supplementary Material 


1 Intuitive example of LVM and GLYM 


Figure 1 gives intuitive examples of positive and negative latent variables. 



Object Location As 
Latent Variable 


Object Parts and Subparts As 
Latent Variable 


Context As 
Latent Variable 


LVM 





Figure 1: Intuitive example of LVM and GLVM. Assume the task of cow image class- 
fication. This figure visualizes examples of possible latent variables for LVM and GLVM 
models. A latent variable can be assigned at image level such as the background scene and 
then gradually go down a hierarchy of objects, parts and subparts. Positive latent variables 
(modeling evidences for cow) of an LVM may correspond to different scenes where cow is 
likely to be seen e.g. stable, meadow (blue boxes in the first column). Another positive latent 
variable can be assigned to the location of the cow in the image (second column, first row). 
In addition, location of different parts of cow e.g. head, back, legs can be modeled as other 
latent variables dependent to the latent variable of cow location (top right). All these exam¬ 
ples of latent variables in LVMs look for evidences for the existence of cow. In GLVM we 
encourage the modeling of counter evidences through negative latent variables. For example 
at high level a negative latent variable may correspond to scenes where cow is unlikely to 
be found e.g. office, laundromat, or airport (bottom left). Meadow is an evidence for cow, 
but cows are less likely to be seen in a meadow beside camping tents (a dependent negative 
latent variable to positive latent variable meadow). Horses are visually similar to cow but 
people usually do not ride cows thus a human on top of cow can be negative evidence (a neg¬ 
ative latent variable dependent to the positive latent variable of cow location). Furthermore, 
human, a negative latent variable for cow, can have its own positive latent variable as parts 
e.g. face and shoulders (bottom right). 


©2015. The copyright of this document resides with its authors. 

It may be distributed unchanged freely in print or electronic forms. 
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2 Dependency Structure 

Complexity of evaluating GLVM scoring function is governed by the dependency structure 
of the underlying feature function (j). In the most extreme case, with no independence as¬ 
sumption at all, evaluating the scoring function would require exhaustive search on the space 
of all possible assignments of the latent variables. This grows exponentially with the number 
of latent variables and becomes intractable very quickly. 

Tree-structured dependencies are particularly interesting and widely used in practice [1,4]. 
Complexity of evaluating the scoring function with tree-structured dependency is quadratic 
in the cardinality of the domain of a single latent variable. 

Dependency structure of the feature function also affects the complexity of finding the fixed 
latent assignments in inference functions of training, and hence, the construction of the 
bounds B t . 

Another note is that although we call a^s positive latent variables and s negative latent 
variables, this is only for simplicity of notation and explanations. That is, the true meaning 
of these variables as evidences or counter evidences for the positive class changes by the de¬ 
pendency structure of the scoring function. For example a negative latent variable depending 
on another negative latent variable of the positive class is not a counter evidence for positive 
class nor is a evidence. Conversely, positive latent variables dependent to a negative latent 
variable act as a counter evidence for the positive class. 


3 Algorithm 

Details of the training framework is given in Section 4 of the main paper. Here is the algo¬ 
rithm corresponding to the training procedure of GLVM. The equation numbers refer to the 
main paper. 


Algorithm 1 Training GLVM classifiers. Note that equation numbers refer to the main paper. 

Input: D = {{xi,yi)}? =v wo 
t 4— 0 

repeat 

Construct S t (w) from Equation (11) and w t 
Construct S t (w) from Equation (12) and w t 
Construct B t (w) from Equation (8) 
w t+ i 4— argmin w B t (w) 
t 4— t -\-1 

until w t does not change 

Output: w t 


4 Deformable Part Models 

Part based models have had a long and successful presence in the history of Computer Vision 
[2, 3,4, 5,9]. 

Deformable Part Models (DPMs) [4] have been extensively used for object detection 
over the past decade. Despite their incredible flexibility to object deformation and viewpoint 
changes they only look for positive evidence in images. In particular, DPMs do not capture 
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counter evidence. For example, the knowledge that a cow-head detector fires strongly on a 
detection window should decrease the score of horse-detector on that window. We propose an 
extension to DPMs by augmenting them with negative parts. We call our model Generalized- 
DPM (or GDPM in short). 

Training of the proposed model does not fit in the latent S VM framework [4] . We propose 
an algorithm to learn DPMs with negative parts. Our experimental results show that negative 
parts complement DPMs [4]. DPM is a mixture of multi-scale part-based object detectors. 
Each mixture component captures appearance of the object in a specific viewpoint. Each 
mixture has a root filter for capturing the global structure of the object in a coarse scale. 
It also has a set of deformable part filters for capturing parts of the object individually in 
a higher resolution. DPM uses sliding window search to detect objects. The root filter is 
used as the detection window during training. Consider a DPM with n parts. We denote the 
location (and size) of the root filter and the parts by r and / = {l\ ,..., /„) respectively. We also 
denote the choice of the mixture component by c. The true root, and part locations as well 
as the choice of the mixture component are unknown even during training and, therefore, are 
treated as latent variables in a DPM. 

We define a latent configuration as an assignment of all latent variables in the model and 
denote it by z = (c, r, /). Score of a DPM with latent configuration z on image v is a linear 
function of the model parameters, which we denote by /3, and is defined as follows: 

gp(x,z) = P-®(x,z) (1) 

where d>(v,z) extracts features from image x according to the latent configuration z. 

Let R be a set of candidate root locations (for examples, all detection windows that have 
high overlap with a given image region). We denote the detection score on R by fp (x,R) and 
define it as the score of the best latent configuration that places the root filter at a location in 
R as follows: 


fp(x,R) 


max gp(x,z) 
z={r,c,l) 
s.t. r ER 


( 2 ) 


4.1 Training Deformable Part Models 

Let V = bi)}^ denote a set of A f annotated training examples. Xi denotes an image. 

yi G {+1, — 1} denotes a class label that determines whether the example is foreground (y z - = 
+ 1) or background (y t = — 1). bi denotes the coordinates of the annotated bounding box that 
circumscribes the object in a foreground image, bi is irrelevant if y z = — 1. All root filter 
locations on a background image are considered as independent negative examples. 

DPM uses the root filter as a detection window during training. Despite availability of 
bf 5 s for foreground images, DPM treats true detection bounding boxes on the foreground 
images as latent variables. Any root filter whose overlap with the annotated bounding box 
bi exceeds a certain threshold T is considered as a valid detection (true positive) for bi. We 
denote the set of valid detections of bi by H(xi,bi, t). In order to train DPMs a separate 
training set D = {(xi,yi,Ri)}f =1 is constructed from the annotated examples in V. Each 
background example (x,y = -l,^j E D adds multiple examples (x,y,{r}) to D, one for 
each root placement r. Each foreground example (x,y = +1 ,b() E V adds one example 
(x,y,R) to D where R = H(x,b, t). In DPM, the value of T is set to 0.7. 
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DPM uses Latent SVM [4] for training. Given the training set D, the model parameters 
P are trained by minimizing the following objective function: 

0(P) = hm 2 +cf j max{0,1 - yj p *,-)} (3) 

i= 1 

The objective function in Equation 3 is non-convex. Particularly, fp(xf,Ri) as defined in 
Equation 2 is maximum of a set of linear functions gp(x,z), and therefore is convex. Then 
for samples with y>0 the hinge loss in Equation 3 becomes concave. 

Latent SVM employs an alternating approach similar to CCCP [8] to minimize the training 
objective. CCCP is an iterative procedure that alternates between 1) constructing a con¬ 
vex upper bound on the original objective function and 2) optimizing the bound. CCCP is 
guaranteed to converge to a local minimum of the original objective function. 

In the case of DPMs, the first step boils down to fixing the value of the latent variables in 
Equation 3 for foreground examples. Let Zi = .. .,/*>) denote a particular assign¬ 

ment of the latent variables for the foreground example (x/.y,-./?,). A convex upper bound to 
O(p) is defined as follows: 


S(/3)=Lj 6|| 2 + C £ max{0,l +f p (x i ,R i )} 

i= 1 

yi =- 1 

N 

+ C £ max{0,l -gp(xi,zt)} (4) 

i= 1 

ypM-i 

We call this the relabeling step of the training algorithm. 

The second step involves solving the following optimization problem: 

j3* = argmin#(/3) (5) 

P 

Any assignment of zi s for foreground examples leads to a convex function (Equation 4) 
which is an upper bound to O (Equation 3). In order to guarantee progress in each iteration 
we require that z\ s are set to the maximum scoring configurations under the previous model 
P* as follows: 


Zi = argma xgp*(xi,z), Vi s.t. >’, = +1 

Z 


( 6 ) 


4.2 Negative Parts 

In a binary classifier, a negative (positive) part is defined as a part whose maximum response 
is subtracted from (added to) the classification score [6] . 

While positive parts collect evidence for presence of an object negative parts collect 
counter-evidence for it. For example, consider a detection window CO. Presence of a saddle 
in the detection window CO should decrease the score of a cow detector on CO. This is because 
cows tend not to have saddles on their backs. Thus, if we find a saddle in the detection 
window of a cow detector we, most likely, are confusing a horse for a cow. Counter evidence 
is particularly important in discriminating between similar objects; like cow from horse in 
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the hypothetical example above. We obtain this behavior by augmenting cow detector model 
with a negative part that fires on saddle. 

Finally, we note that standard DPMs cannot capture counter-evidence. For example, if F 
is the appearance model of a positive part, — F does not make it a negative part. In this paper 
we extend DPMs by augmenting them with negative parts. 


4.3 Adding Negative Parts to Deformable Part Models 

In DPM, responses of all part filters are added to the detection score of the classifier. Thus, 
DPMs cannot model negative parts. We generalize DPMs by augmenting them with negative 
parts. We call the proposed model Generalized-DPM (or GDPM in short). Standard DPM [4] 
is a GDPM with zero negative parts. 

We denote a GDPM with k mixture components by y = (yi. Each mixture component 
captures the appearance of the object in a viewpoint. A mixture component with n positive 
parts and m negative parts is parameterized by y c = (F c? o; w + x ;.. •; w ~ 1 ;... ;w“ m ). F c $ 

denotes the root filter, wj~’s denote the positive part models, and wj ’s denote the negative 
part models. Each part model w c j = (F c j;d c j) has an appearance F c j and a deformation 
parameter vector d c j. 

Let / + = (/^,..., /+) denote the positive part placements and /“ = (/j~ ) denote the 

negative part placements. We denote a latent configuration in a GDPM by z= (r,c,/ + ,/ - ). 
Score of a GDPM y with latent configuration z on image x is a linear function of the model 
parameters y. We denote this by gy(x,z) and define it as follows: 


g r ( x ,z) = y-®(x,z) 


(7) 


where <f> extracts features from image x according to the latent configuration z. This is 
very similar to the score function of the standard DPM (see Equation 1). For example, 
both functions are linear in the model parameters. Detection score of placing the root at r, 
however, is fundamentally different for the two models. For example, unlike Equation 2 we 
cannot compute fy(x,R) by maximizing over all the latent variables. We will elaborate more 
on the differences between fp (x,R) and fy(x,R) in the rest of this section. 

Like in DPM, part dependencies in GDPM respect a star configuration with the root filter 
at the center of the star. Thus, part placements are independent given the location of the root 
filter. Let (/) a (x,lj) denote appearance features extracted from patch lj of image x (e.g. HOG 
features). Also let 0j(/j, r) denote the deformation features of placing part j at lj given that 
the root filter is placed at r. We define the feature function <j>(x, lj , r) = (0 a (x, lj); <$>d{lj , r ))• 
For DPMs, we can rewrite Equation 2 as follows: 



n 


+ Y, max w c j ■ <j)(x.J h r 


(8) 
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In contrast, for GDPMs we have: 

k 

fy{x,R) = maxmax/v 7 o • r) 

reR c= 1 

n 

+ £maxw+.-0(x,/t, r ) 

J= 1 1 
m 

- £ max • 4 > (x, Ij , r) (9) 

7=1 0 

One way to see that GDPM is more general than DPM is to note that the score function 
of Equation 8 is convex in the model parameters whereas Equation 9 is not. 


4.4 Training with Negative Parts 

We define the training objective of a GDPM as follows: 

1 N 

C>(y)=d|y|| 2 + CL max{0,1 - (10) 

Z i= 1 

This objective function is non-convex. Moreover, unlike DPMs, fixing the latent variables 
for the foreground examples does not make it convex. Thus, we cannot use Latent SVM 
framework to train GDPMs. 

We deploy CCCP for training the objective function of Equation 10. Recall that CCCP 
alternates between constructing a convex function that upper bounds the original objective 
function and optimizing the bound. 

We obtain a convex upper bound B(y) to the training objective by replacing fy(xi,Ri) 
with a convex upper bound fy for background images and a concave lower bound fy for 
foreground images. 


B{i)=\\\y\\ 2 + C £ max{0,l+/ r (x;,fl;)} 

Z i= 1 

^=-1 

N 

+ C Y, max{0,1 -f r (xt,Ri)} (11) 

1=1 

yi =+1 


In the rest of this section we first explain how the bounds are constructed and then elaborate 
on the optimization of the bound. 

4.5 Constructing Convex Bounds 

In this section we explain how we obtain fy and fy in order to construct the bound B(y) in 
Equation 11. Let y* be the solution to the previous convex bound. We define /^-(r,c) = 
argmax; /, r) to be the optimal placement of the j-th negative part using the pre¬ 

vious model conditioned on the choice of the root location r and the mixture component c. 
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We obtain f y by fixing the placement of the negative parts according to /■ ^(r, c ) as follows: 

v 

fy(xi,Ri) = maxmaxF c0 • 0(*;,r) 

reR c=l 

n 

+ Y maxw+ r <j)(xi,l+,r) 

J = 1 'j 
m 

- L w ^7 ■H x hYj( r > c )> r ) ( 12 ) 

7=1 

Let (r z , Cj, 1 +) denote the best placement of the root filter, choice of the mixture component, 
and placement of the positive parts using the model y* given that the negative parts have 
been placed optimally according to /r~-(r,c). We can formally define this as follows: 

(n-Ci.lt) = argmaxF c * 0 • (j) r) 

(r,c,l) 

+ XAcj'0( x ‘’ / /’ r ) 

7=1 

m 

- L w*~j • (13) 

7=1 

We obtain f y as follows: 

f y {Xi,Ri) = Fa ,0 ‘ ^i) 

+ L W qJ-^( Xl ’ Z U’ r ') 

7=1 

m 

-^ma xw~j-<l>(xi,lj,ri) (14) 

7=1 '/ 


4.6 Initializing Negative Parts 

The positive parts in original DPM [4] is initialized by first learning a root filter (whole object 
detector without parts). Then in order to initialize the location and weights of a part filter, a 
submatrix of root filter is sought which has the maximal positive energy. Positive energy is 
measured by sum of squared positive weights in the submatrix of the root filter. The first part 
is initialized with the best submatrix location and weights. Then all the weights are zeroed 
at that location and the procedure is repeated to initialize a second part. This continues until 
the desired number of parts are added. We initialize positive parts in the same manner. Also, 
we initialize the negative part similarly, except that we try to find the submatrix with highest 
negative energy being sum of squared negative weights of the submatrix. This involves the 
main results of the paper. 

We further initialize the negative parts with the positive parts of similar classes. Particularly, 
we initialize the negative parts for cow GDPM model by taking positive parts from horse 
DPM model. 
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4.7 Inference for Data Mining 

When doing inference for GDPM data mining, we need to construct a concave lower bound 
for positive samples and convex upper bound for negative samples. That is, we need to fix 
the position of negative parts for negative samples at the location of relabeling. We further 
need to fix the location of positive parts, mixture component and root filter location for pos¬ 
itive samples to the assignments at the relabeling. On the other hand, the positve parts and 
mixture component needs to be maximized for negative samples and negative parts needs 
to be updated for positive samples. Storing the fixed values of latent variables for all nega¬ 
tive samples (hundreds of thousands negative bounding box for each image) is prohibitively 
memory-consuming. Thus, in our implementation, we do the following. At each relabeling 
step we keep a copy of the model. We also store the inferred component and location of 
the root filter only for positive samples (requires low amount of memory since number of 
positive annotations are usually low). Then at each iteration of data mining: For positive 
samples we take the current model and roll back the positive parts to the copied model, and 
do the inference at the saved location and with the saved component. As for negative ex¬ 
amples, we take the current model and roll back the negative parts to the copied model and 
do the inference. This way we simply take care of convex upper bound and concave lower 
bounds with low memory footprint. However, this comes at the cost of two times more com¬ 
putational complexity. Since the main complexity of the inferences goes back to convolution 
of part/root filters, the remedy for this would be to save the output of convolution for posi¬ 
tive parts on positive examples and negative parts on negative examples (proportional to the 
number of images and independent of number of negative bounding boxes per image). 
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5 Dataset 

Random images from Cat Head dataset [7]. Note that, many of the cat heads are not frontal. 



(a) Abyssinian 



(c) Birman 




(b) Bengal 





(d) Bombay 


(f) Egyptian Mau 



(j) Russian Blue 



(k) Siamese 


(1) Sphynx 


Figure 2: Samples of Pet [7] dataset used for cat head detection 
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6 Precision Recall Curves 

In this section, the precision recall curves for DPM comparing with its best performing 
GDPM variant is plotted for animals subset of PASCAL VOC 2007. For the class of cow 
an extra curve is plotted which belongs to a different initialization of negative parts using 
horse DPM model. For the class of bird and dog while the AP (calculated by average of 
11 sampled points) is equal for GDPM and DPM, the GDPM curve shows slightly better 
precision at lower recalls. 



Precision 


(a) PR curve for cow in PASCAL VOC 2007 
using original DPM, GDPM with 2 negative 
parts, and GDPM with one negative part ini¬ 
tialized by horse DPM model. 



(c) PR curve for dog in PASCAL VOC 2007 
using original DPM, GDPM with 1 negative 
part. Although the AP (computed by taking 
average of 10 sampled points from the curve) 
is equal, the GDPM curve has slightly higher 
precision at lower recall. 



Precision 


(b) PR curve for bird in PASCAL VOC 2007 
using original DPM, GDPM with 1 negative 
part. Although the AP (computed by taking 
average of 10 sampled points from the curve) 
is equal, the GDPM curve has slightly higher 
precision at lower recall. 



Precision 


(d) PR curve for sheep in PASCAL VOC 2007 
using original DPM, GDPM with 2 negative 
part. The GDPM curve exhibits higher preci¬ 
sion at different recalls. 


Figure 3: PR curves for animals subset of PASCAL VOC 2007 
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